Getting Started: R and BigQuery

This slide deck was built in Quarto!

  • Use keyboard arrow keys to
    • advance ( → ) and
    • go back ( ← )
  • Type “s” to see speaker notes
  • Type “?” to see other keyboard shortcuts

Itinerary for Second Hour

  • Exploring Data
  • SQL Refresher (or SQL 101)
  • Querying in BigQuery

Exploring Data

You can click on a table’s name to open it to the right. Take a look at the first three tabs available, which will be available for every table.

  • Schema (what the columns are)
  • Details (how big it is, other metadata)
  • Preview (very important because FREE)

Data Exploration Exercise

Look through the tables in the Broadstreet ADI data and the CDC Births data to understand what the data is capturing, and identify the following:

  • One or more fields that has linking information in common, in that the data appears in both datasets (might not have the same field name) – we’ll call this our “join” data
  • One or more columns / fields in the ADI data that help characterize socioeconomic disadvantage and are in the same table as our join data
  • One or more columns / fields in the CDC data that help describe something about birth statistics and are in the same table as our join data
05:00

SQL Refresher

This is the pattern we’ll come back to again and again in SQL (Structured Query Language):

SELECT + FROM + WHERE;

cabin high_score wins losses
Yellow 10 3 0
Blue 7 2 1
Orange 9 1 2

For example: SELECT cabin, high_score FROM camp_kickball_tournament WHERE high_score < 10

Aliases (AS)

Aliases allow you to (re)name things:

SELECT 
  cabin, 
  high_score, 
  wins/(wins + losses) AS prop_win   
FROM camp_kickball_tournament 
WHERE high_score < 10;

Whitespace / Capitalization

Which is easier to read?

SELECT 
  cabin, 
  high_score, 
  wins/(wins + losses) AS prop_win 
FROM camp_kickball_tournament 
WHERE high_score < 10;
select cabin,high_score,wins/(wins + losses) as prop_win from camp_kickball_tournament where high_score < 10;
  • Be consistent (reduce cognitive load on teammates / future you)
  • Capitalize key words so they stand out
  • Indent the parts of your query
  • Consider using a new line for every field

Query Button

Use the “Query” button (or the one with the magnifying glass) when you’re looking within a table to avoid annoying syntax issues like having to type fully qualified table names (project.dataset.table).

Queries in BigQuery

Once you click “Query Table” (and then “New Tab” or whatever you prefer), you’ll see a partial query, with the cursor located so that you can add fields after SELECT.

You’ll also see warnings that your query is invalid. That’s expected!

Your First Query

  • You might have to adjust your pane widths to see all the buttons you need.
  • Please click on the table in the CDC data (sdoh_cdc_wonder_natality) called county_natality.
  • Click on the “query” button and choose “In split tab”.
  • Add one or more fields from the schema of the table between the words “SELECT” and “FROM” in the query.
  • When you get a green check mark showing that your query is valid, click the blue “Run” button above your query.

03:00

Using Gemini to Learn SQL

Let’s use the AI assistant to learn some SQL!

JOINs in SQL

I asked you before to find three things, and now we’ll use them in a JOIN in SQL.

  • One or more fields that has linking information in common, in that the data appears in both datasets
  • One or more columns / fields in the ADI data that help characterize socioeconomic disadvantage and are in the same table as our join data
  • One or more columns / fields in the CDC data that help describe something about birth statistics and are in the same table as our join data

Joins – things to consider

What to join?

  • What two tables give us a fuller picture of our data, when used together?
  • What fields from each table will we include?

What constitutes a join?

  • What needs to happen for rows to be a “match”?
  • Which linking field(s) are involved?

Which type of join / data completion

  • Which part of the Venn Diagram of table overlap do we care about? Things that show up in only one table? Or in both?

Join Types

INNER JOIN (or just JOIN)

LEFT (or LEFT OUTER) JOIN

RIGHT (or RIGHT OUTER) JOIN

FULL (or FULL OUTER) JOIN

Our example: ADI and Birth Characteristics

  • What data do we want? Information about socioeconomic disadvantage (ADI tables) and how that relates to birth characteristics (CDC tables)
  • What will link the data from the two sources? The county has to match (either the name or the FIPS code or both)
  • What kind of data completion do we want? Only data where we have both ADI data and CDC data about the county.

Left, Right

Syntax of a JOIN

The basic syntax of a join is:

SELECT [fields we want]
FROM [left table name] [some join type] [right table name]
ON (or USING) [join criteria]

A Sample JOIN

JOIN Complications

We have multiples of counties, with varying ADI scores, in our query results. That’s not great – we want each county to appear only once.

What’s going on? Dates.

A First Try With Gemini

Gemini gets us the start of a query, but it won’t quite work. But it does give you a nice look at aliasing!

Riffing on Gemini

This is what Gemini gave me:

SELECT
    t1.Ave_Birth_Weight_gms,
    t2.area_deprivation_index_percent
  FROM
    `bigquery-public-data.sdoh_cdc_wonder_natality.county_natality` AS t1
    INNER JOIN `bigquery-public-data.broadstreet_adi.area_deprivation_index_by_county` AS t2 ON t1.County_of_Residence_FIPS = t2.county_fips_code
     AND t1.Year = t2.year

Error!

No matching signature for operator = for argument types: DATE, INT64. Supported signature: ANY = ANY at [7:10]

Extracting part of a date

DATE is a type that has year, day, and month. But if I just have a four digit year, that’s an integer (INT64).

I can’t compare an INT column that has “2019” to a DATE column that has “2019-01-01”

Googling SQL

Include what you’re trying to do and “in BigQuery.”

In our case, EXTRACT seems likely to work.

Tweaking What Gemini Gave Us

SELECT
  cdc.Ave_Birth_Weight_gms,
  adi.area_deprivation_index_percent
FROM
  `bigquery-public-data.sdoh_cdc_wonder_natality.county_natality` AS cdc 
  INNER JOIN `bigquery-public-data.broadstreet_adi.area_deprivation_index_by_county` AS adi
ON cdc.County_of_Residence_FIPS = adi.county_fips_code AND 
   EXTRACT(YEAR FROM cdc.Year) = adi.year
   

Sanity Checks

SELECT
  cdc.Ave_Birth_Weight_gms,
  adi.area_deprivation_index_percent,
  cdc.County_of_Residence,
  adi.county_name,
  cdc.Year,
  adi.year
FROM
  `bigquery-public-data.sdoh_cdc_wonder_natality.county_natality` AS cdc 
  INNER JOIN `bigquery-public-data.broadstreet_adi.area_deprivation_index_by_county` AS adi
ON cdc.County_of_Residence_FIPS = adi.county_fips_code AND 
   EXTRACT(YEAR FROM cdc.Year) = adi.year

Grouping and Aggregation

One last thing – we now are matching on year, but what if we have data on 2018, 2019, and 2020 for a given county, in both tables? We’ll match that data up properly, but some counties might be over-represented in our analysis.

Saving Your Work: Tables

In your Query results, you can save your work. In our case, we want to save it as a BigQuery table, in our project.

  • Create a new dataset (single region) with a good name
  • Save the table in that dataset with a good name

Create a Dataset

Create table

Saving Your Work: Queries

Look at Your Resources

Break

Let’s take a break and then come back to work in R!

10:00